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Abstract 

The development and the implementation of a Particle-in-Cell code 
written in the Unified Parallel C (UPC) language for plasma simu- 
lations with application to astrophysics and fusion nuclear energy 
machines are presented. A simple one dimensional electrostatic 
Particle-in-Cell code has been developed first to investigate the im- 
plementation details in the UPC language, and second to study the 
UPC performance on parallel computers. The initial simulations of 
plasmas with the UPC Particle-in-Cell code and a study of parallel 
speed-up of the UPC code up to 128 cores are shown. 

General Terms Particle-in-Cell, UPC performance 

Keywords UPC Particle-in-Cell, UPC, PGAS language 

1. Introduction 

The Unified Parallel C (UPC) 1 1|, a parallel extension of the ANSI 
C, is one of the most promising Partitioned Global Address Spaces 
(PGAS) programming languages for scientific applications. The 
PGAS languages emerged as alternative to MPI |2|, OpenMP |3 | 
and POSIX threads for parallel programming of parallel comput- 
ers. Two main features make the PGAS languages perfect candi- 
dates for scientific computing on parallel machines: the Single Pro- 
cess Multiple Data (SPMD) execution model and the global ad- 
dress space. First, the SPMD execution model, such as in MPI |2|, 
guarantees performance and good scaling on parallel environments. 
Second the global address space enables the creation of shared data 
among processors making message passing among processors not 
necessary, and code implementation of scientific applications on 
parallel computers easier. 

The scientific application under study is the simulation of 
plasma by a computational method, called Particle-in-Cell, to sim- 
ulate fusion energy machines and space physics plasmas |4|. In the 
Particle-in-Cell method, a set of computational particles moves as 
real particles in a self consistent electric field calculated by solving 
the Poisson equation. Because of the use computational particles, 
the Particle-in-Cell method shares features with the other common 
simulation techniques, such as Molecular Dynamics, Monte Carlo, 
and Agent-based simulations 15J. For this reason implementation 



techniques for Particle-in-Cell codes are of large interest for other 
computational techniques also. 

Studies of the UPC performance can be divided in two cate- 
gories. A class of benchmarks focuses on study of the communi- 
cation and data movement primitives to analyze communication la- 
tencies and bandwidth 1 6 1. The second category regards to the over- 
all UPC performance of applications and mathematical kernels [Tj, 
such as the UPC implementation of the NAS parallel benchmarks 
(NPB) | 8 1. The present study belongs to this second category. 

A first Particle-in-Cell code has been written in the UPC lan- 
guage. This skeleton Particle-in-Cell code (presented in Appendix 
A) mimics very closely the parallel decomposition, the synchro- 
nization pattern, the data layout of the production code written in 
C-i~i- and MPI developed by the Authors |4|. The code was writ- 
ten to study possible development and performance issues before 
developing a major production Particle-in-Cell code in the UPC 
language. 

The paper is organized as follows. Section 2 presents the gov- 
erning equations, the numerical schemes and the Particle-in-Cell 
algorithm. Section 3 focuses on development and implementation 
issues: it analyzes the parallel work decomposition, the data layout, 
and presents how the reduction operations were completed. Sec- 
tions 4, 5 and 6 conclude the paper showing the simulation results 
regarding to a plasma instability, the parallel performance using up 
to 128 cores, and analyzes the parallel speed-up. The skeleton ver- 
sion of the UPC Particle-in-Cell code, that has been used for the 
simulations of this paper, is included in Appendix A. 

2. The Particle-in-Cell Algorithm 

In the current UPC implementation of the Particle-in-cell method, 
three approximations in the plasma model have been assumed. 
First, a one dimensional geometry is used: plasma particles are con- 
strained to move along a line. Although this geometry is unrealistic 
in many cases, it is acceptable when the plasma is characterized by 
a main direction of motion. For instance, beam interactions are in- 
herently one dimensional and are well described in the one dimen- 
sional geometry. Second an electrostatic formulation of the plasma 
is used. This means that magnetic field is not present and particles 
can be only accelerated by electric field created by charge sepa- 
ration. Third, it is assumed that only electrons move while the mo- 
tion of plasma protons is neglected: because protons are 1836 times 
bigger than electrons, they do not move considerably over the time 
period of a typical simulation and can be considered motionless. 

In the Particle-in-Cell method, plasma electrons and protons are 
mimicked by the computational particles to determine the evolution 
of the plasma system | 9 1. The trajectories are computed by solving 
the equation of motion for each particle. In order to update the 
particle velocity, the force acting on the particle must be calculated. 



In the Particle-in-Cell method, a grid is introduced and an average 
force at each grid point is calculated, instead of calculating the 
force directly from other particles (such as in Molecular Dynamics 
simulations). At each grid point the charge density is found by 
counting the particles belonging to that cell. The values of the 
electric field defined on the grid are computed by solving the 
field equation starting from the charge density. The electric field 
acting on a particle is then calculated taking the electric field of 
the cell the particle belongs to. In this way, the introduction of a 
grid leads to a computational cost of 0{Np + Ng log Ng) (where 
Np is the number of particles, Ng is the number of grid points and 
the fastest field solver requires 0{Ng log Ng) operations), avoiding 
the 0{Np) operations of calculating directly the electric field from 
other particles. 

The Particle-in-Cell method consists of an initialization and 
a computational cycle, composed by update of particle positions 
and velocities, calculation of the field on the grid and calculation 
of interaction between grid and particles (interpolation). At the 
beginning of the Particle-in-Cell simulation, the particles positions 
and velocities are sampled according to the particle distribution of 
the specific problem. For instance, a beam can be represented by a 
uniform distribution in space with all the particles having the same 
drift velocity VO. After the initialization, the computational cycle 
is repeated many times. 

2.1 Particle mover 

At each computational cycle, the trajectories of the Np computa- 
tional particles are determined by solving the equation of motion 
for each particle p: 

^''^ p=l...N,, (1) 
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where Xp,Vp are the particle position and velocity, q,m are the 
charge and mass of the particle, Ep is the electric field acting on 
the particle, and Np are the number of particles. Many integrators 
can be used to solve numerically the particle equation of motion 
(Eq[TJ. The numerical technique used in the UPC Particle-in-Cell 
is the leap-frog method |9 |, also known as Verlet algorithm in the 
context of Molecular Dynamics simulations | 5 1. In this scheme, the 
particle position is evaluated at integer time levels n, n + 1, while 
the velocity at half integer time levels n — 1/2, n + 1/2: 
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2.2 Field solver 

The field equation in the electrostatic case (no magnetic field) 
reduces to the Poisson's equation: 



f^/dx^ 



(4) 



where $ is the electrostatic potential, and p the charge density. The 
electric field can be calculated from the potential as derivative of 
the potential $: 

E^-d^/dx (5) 

The Poisson Equation |4] is numerically differenced with a second 
order accurate central difference. If the space is discretized in Ng 
cells with Ax width, each grid point g (g = 0,1, ...(Ng— 2), (Ng — 
1)) arise an algebraic equation that forms a line of the matrix of the 
linear system: 
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The Ng equations above form a linear system. A matrix-free itera- 
tive linear solver, the Conjugate Gradient (CG) 1 1 1 1, is used in the 
current implementation of the UPC code. This particular method 



was chosen because it is in use in the Author's Particle-in-Cell pro- 
duction code. Once the potential is known at each grid cell, the 
electric field on the grid Eg is computed by central difference in 
space: 

S," = -($^+i-<I>^i)/(2Ax) (7) 

2.3 Interpolation 

The field Equation [6] requires to calculate the charge density at 
each grid point. This is accomplished by counting the number 
of particles belonging to the cell, multiplying this number by the 
charge Q and dividing by the grid spacing (Q/ Ax). In the current 
implementation of the UPC Particle-in-Cell code, each particle 
contribution to the charge density is split in two parts to reduce the 
numerical noise |9 |. Each particle contributes to the charge density 
of two cells: the cell, the particle belongs to, with a weight (wi = 
(xp — Xg)/Ax) that is proportional to the distance between the 
particle and cell center positions, and the neighbor cell that receives 
the rest of the charge contribution with weight W2 = 1 — wi. The 
same mechanism is used in the particle mover stage to calculate the 
electric field Ep acting on the particle p. The electric field is taken 
as a weighted combination of the electric fields defined on the grid 
points that are closest to the particle. 

2.4 Algorithm summary 

In summary, the Particle-in-Cell algorithm is organized as follows. 
The simulation parameters (particle positions velocities, and fields) 
are initially set up and a computational cycle is repeated many 
times. At each computational cycle, the new particle positions are 
calculated by using the first part of Equation [2] and the charge 
density on the grid is calculated on the grid by counting the amount 
of charge per cell (interpolation particle to grid). Once the charge 
density is known, the electrostatic potential is calculated by solving 
the linear system of Equation[6] and the electric field is computed as 
the derivative of the electric field. Finally the new particle velocity 
is calculated by solving the second part of Equation |2] as last step 
of the computational cycle. 

3. Development of the Particle-in-Cell in UPC 

A UPC Particle-in-Cell code has been implemented. The following 
sections present the development and implementation choices. A 
skeleton version of the Particle-in-Cell code is included in this pa- 
per in Appendix A to help the reader in understanding the Particle- 
in-Cell algorithm and the development issues. 

3.1 Parallel decomposition and synchronization 

Although the Particle-in-Cell method is a rather simple algorithm 
in the basic formulation, it poses at least two parallel designing 
challenges. 

First, because particles and grid points are coupled by the two 
interpolation steps (to calculate first the charge density on the grid 
from the particles, and then the electric field on the particle from 
the grid), particles and the grid points, that are close to the parti- 
cles, need to be stored on the same process to avoid interprocess 
communication. This is achieved by dividing the simulation box in 
smaller domains, and placing the part of grid and particles belong- 
ing to that domain on the same process (Domain Decomposition). 
Once particles move from one domain defined on one process to 
the contiguous domain defined on a different process, the variables 
of these particles will be communicated to the different process. In 
the UPC Particle-in-Cell code each process has two communica- 
tion buffers, that are filled with particles exiting the left and right 
sides. After the the particle positions have been updated, contigu- 
ous domains exchange the particle buffers. Moreover, the values of 
neighboring cells at the edge of the domain will be communicated 
to calculate the potential and electric field values. 



Second, because the Particle-in-Cell in its common formulation 
is a time-driven simulation, each stage of the Particles-in-Cell has 
to be synchronized over all the concurrent processes. Barriers need 
to be added in the code to ensure all the processes reach the same 
point before advancing to the next simulation step. 

3.2 Data layout 

Different techniques have been proposed to organize particle and 
field variables in memory to obtain efficient computer codes 1 10 |. 
There are two main approaches: data can be organized in struc- 
ture of arrays (SoA), where particles positions and velocities are 
stored in arrays, or in arrays of structures (AoS), where each par- 
ticle position and velocity (and other particle data) is organized in 
a single particle structure. The first approach allow to use compiler 
auto-vectorization, while the second requires explicit vectorizing 
instructions coding. Moreover, the second approach leads to an in- 
creased cache performance. Because the main code developed by 
the Authors is based on the first approach, the UPC Particle-in-Cell 
code stores particles and field variables in arrays. This data lay- 
out results in enhanced performance of the code 1 10 |. Memory in 
the PGAS languages is logically partitioned in shared and private 
memories, in a two-level memory hierarchy fashion. Shared mem- 
ory is global and accessible by all the threads: the same memory 
address on each thread refers to the same memory location. This 
comes at the cost of an increased performance overhead due to hid- 
den memory movement and communication (on distributed mem- 
ory machine). Instead, the private memory is local in each thread: 
the same memory address on each thread refers to different mem- 
ory locations. A shared type qualifier is used to declare a variable 
shared among threads, while all the variables are implicitly local in 
absence of it. Moreover, the UPC language defines the concept of 
affinity as association between a shared variable and threads. Each 
element of data storage that contains shared objects has affinity to 
exactly one thread. The use of variables with affinity to the same 
thread improves the performance. It is important to consider affinity 
in order to avoid unnecessary data movement and communication 
(on distributed memory machines). When dealing with the Particle- 
in-Cell code, a concern is the memory layout of field quantities 
(charge density, electrostatic potential, and electric field) and parti- 
cles quantities (positions and velocities). The simulation quantities 
are defined as shared variables in the current implementation. A 
one dimensional domain decomposition has been chosen to decide 
the affinity and divide the workload among the threads: contiguous 
computational cells with the information about the charge density, 
electrostatic potential and electric field are assigned to each thread. 
An array npO keeps track of how many particles are located on 
each thread. Particle positions and velocities xpO, vpO are stored 
as a one dimensional array with fixed length, and maximum block 
size. Two communication buffers xout, vout, one for particle exit- 
ing the left side and one for the right side, save the information of 
particle moving between contiguous processes. The field quantities 
E,phi and rho are shared among processes in the same way, and 
two additional cells rho-ghost provide temporary storage during 
the interpolation stage that will be communicated between neigh- 
bor domain. 

3.3 Reducing operations 

The iterative matrix-free CG solver of the UPC Particle-in-Cell 
method requires to calculate the inner product of shared arrays, 
whose blocks have affinity with different threads. The inner prod- 
uct operation is obtained by computing locally a contribution to 
the whole inner product, and reducing by adding the local values 
over all the processes. Different solutions were possible to perform 
parallel reduction in the CG solver. In this work the BUPC collec- 
tive library 1 1 1 based on the communication library (GASNet) is 



used. In particular the function bupc mIIv -reduce mII( (TYPE, TYPE 
value, upc-opJ reductionop)) is called at the beginning and at each 
iteration of the field solver. 

4. Verification Test 

The code has been verified with the standard benchmark of the two 
stream instability |9|. The two stream instability is an important 
phenomenon occurring in space physics, in the injection systems 
for nuclear fusion machine, and in particle accelerators. In this 
problem, two electron beams flow initially in opposite directions in 
a neutralizing background of motionless protons. The two beams 
extinguish as result of the beam instability. Figure [T] shows two 
time snapshots of the phase space (x, v particles position- velocity 
space). In this plot, each particle represents a single point, whose 
the X coordinate is its position and the y coordinate is its velocity. 
The beams are moving at VO = +/ — .2/c (c is the speed of 
light in vacuum) and beam particles are uniformly distributed in 
space at time equal to zero. Therefore this distribution results in 
two lines at VO = +/ — .2/c in the phase space plot of Figure [T] 
The instability grows and mixes the two electrons beams, creating 
vortices in the phase space at time equal to 20 (time is normalized 
over the inverse of the plasma frequency). Beam particles, that were 
moving at velocity V^O = +/ — .2/c initially, move with velocity 
close to zero or with opposite direction. The UPC simulation of 
Figure [T] has been completed with four threads on four cores. The 
different particle colors in the phase space plot of Figure[T]represent 
the particles variables affinity. 

5. Performance Analysis 

5.1 Test environment 

The NERSC supercomputer Franklin| 12 | has been used for com- 
pleting the performance tests. Franklin is a Cray XT4 massively 
parallel computer with 38,128 cores. Each compute node consists 
of A 2.3 GHz single socket quad-core AMD Opteron processor 
(Budapest) and 8 GB of memory (2GB per core). The code has 
been compiled with the Berkeley UPC compiler version 2.10.2 [[l] 
using the Cray's Portals low-level communications network API. 
The number of processes is provided at compiler time using the 
flag -T=NUM for optimization purpose. 

5.2 Weak scalability tests 

The performance tests have been completed increasing the num- 
ber of process cores and keeping the workload per core fixed (fol- 
lowing the weak scaling notion). In all these simulations, 64 grid 
points, 64000 particles, and a simulation box 3.1415 long is as- 
signed to each core. A Maxwellian plasma with thermal velocity 
0.2 is initialized at beginning of the simulation, and 1000 compu- 
tational cycles have been run. Because the single domain length is 
kept fixed, approximately the same amount of particles are com- 
municated among processes in all the simulations. Moreover the 
iterative solver was forced to complete 150 iterations at each simu- 
lation. The number of iterations of the CG solver increases with the 
total number of grid cells at a given required convergence tolerance 
and it is important to ensure that each core runs the same number of 
iterations for performance test purpose. It is expected that increas- 
ing the number of cores, data movement and synchronization result 
in overhead and increased execution time. 

The relative parallel speed-up has been taken as measure of the 
parallel performance. The smallest run was performed with only 
four cores (one compute node on Franklin supercomputer), and 
thus ideal scaling for the smallest run with four threads has been 
assumed. 

The speed-up of the UPC Particle-in-Cell code is presented first 
in Table [T] and then in Figure [2] The parallel speed-up does not 
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Figure 1. Phase space (x,v) of the two stream instabiUty with the UPC code. Two electron beams counterstream with drift velocities +/ — 0.2 
at t = 0. The instability develops mixing the two beams in the phase space. The simulation has been completed using four cores. Each color 
represents the thread particles belong to. 
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Table 1. Parallel speed-up on the quad-core AMD Opteron pro- 
cessor (Budapest) from NERSC Franklin supercomputer with the 
Berkeley UPC compiler version 2.10.2 1 1 1. The parallel speed-up 
increases moderately. 
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increase considerably increasing the number of cores, and paral- 
lel efficiency degrades. A relative parallel efficiency of 26% was 
recorded in the largest run with 128 cores. Figure [2] shows that the 
parallel speed-up increases moderately with the number of cores on 
the X axis. The red line represents the ideal scaling achievable in the 
best scenario where doubling the number of cores double the speed- 
up also. The speed-up of different parts of the Particle-in-Cell algo- 
rithm, that were presented in Section 2 (particle mover, field solver, 
and interpolation) have been analyzed individually to understand 
how different parts of the algorithm scale with the number of cores 
and where possible performance bottlenecks are. Figure |3] shows 
that the parallel speed-up of the particle mover stage increases with 
the number of processes nearly ideally. Communication and syn- 
chronization results in a small performance decrease. The cause of 
the performance degradation of the overall UPC Particle-in-Cell 
can be seen clearly in the Figure |4] that shows the speed-up of 
the field solver stage. The speed-up is slightly increasing with the 
number of cores resulting in degraded parallel performance. The 
field solver performance bottleneck affects considerably the speed- 
up of the UPC Particle-in-Cell code. The interpolation step perfor- 
mance results shows an excellent speed-up in Figure[5]ln summary, 
the UPC parallel performance decreases increasing the number of 
cores, with a speed-up equal to 34 for a run with 128 cores. The 
speed-up of the UPC Particle-in-Cell code is largely affected by 
the field solver speed-up that does not scale with the number of 



Figure 2. Speed-up of the UPC Particle-in-Cell code. The red line 
shows the case of ideal scaling. 
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Figure 3. Speed-up of the particle mover stage (Section 2.1). The 
speed-up and parallel performance are nearly optimal. A small 
degradation of the performance was expected as results of data 
movement (communication of particle data between neighbor pro- 
cesses and synchronization. 
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Figure 4. Speed-up of the field solver stage (Section 2.2). In this 
case, almost no speed-up is observed increasing the number of 
cores. The performance degradation of the field solver is heavily 
reflected in the low speed-up of the UPC Particle-in-Cell method. 



eration strategy could probably be optimized. Because a large num- 
ber of reducing operations were executed at each computational cy- 
cle, it is important reduction operations are completed in the fastest 
way for a given computer architecture. In fact, reducing operations 
to calculate the inner product are completed one time at the be- 
ginning of the field solver and then twice for each iteration. Thus, 
301 reducing operations were completed at each Particle-in-Cell 
computational cycle in the test cases. On the contrary, the particle 
mover and the interpolations stages presented nearly optimal par- 
allel speed-up. Note that these two stages are not embarrassingly 
parallel problems, and high parallel performance is not guaranteed 
a priori. 

In conclusion, this study indicates the UPC language can be 
used efficiently for the development of Particle-in-Cell codes, es- 
pecially for the particle mover and interpolation stages. However, 
the field solver performance bottleneck needs to be analyzed more 
deeply and removed before proceeding to simulations with very 
high number of cores. 




cores. On the opposite, both particle mover and interpolation step 
show nearly ideal scaling. 

6. Conclusions 

A Particle-in-Cell code for plasma simulation has been developed 
and implemented in the UPC language. The numerical algorithm is 
first presented, and the implementation in UPC are discussed. The 
Particle-in-Cell code has been tested on the standard benchmark of 
the two- stream instability, and the performance of this initial imple- 
mentation on parallel computer has been reported. Weak scalability 
tests showed a relative speed-up of 17, 25, 34 using respectively 32, 
64, 128 cores. Parallel performance decreases considerably increas- 
ing the number of cores. 

An analysis of the parallel speed-up of different parts of the 
Particle-in-Cell code revealed that the performance degradation at 
high number of processes is only due to the field solver part of the 
algorithm. Two causes could have degraded the field solver per- 
formance. First, 64 grid points per process do not provide enough 
workload for one process, and the global communication (or global 
data movement) largely dominate the computation with no advan- 
tage from the parallel computation. Second, the UPC reducing op- 
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A. UPC Particle-in-Cell code 

The skeleton version of the UPC Particle-in-Cell code, that has been used for the performance tests, is presented here. 



#include <upc.h> 

#include <bupc -collective v . h> 

#iiiclude <stdio.h> 

#iiiclude <math.h> 



#define NG_PER_PROC 64 // Number of grid points per process 

#define NP_PER_PROC UPCJV[AX_BLOCK_SIZE // Maximum number of particles per process 
#define NP_PER_BUFFER 16384 // Maximum size of the communication buffer 
#define NG NG_PER_PROC*THREADS // Total number of grid points 
#define NP NP_PER_PROC*THREADS // Total number of particles 

// Convert a double to an int (for interpolation) 
#define dtoi(d) ((iiit)(d)) 

// Delete one electron from the particle array 
#define DELETE_e() npO [MYTHREAD]--; \ 
xpO[ip] = xpO[ start_particle + npO [MYTHREAD] ] ; \ 
vpO[ip] = vpO[ start_particle + npO [MYTHREAD] ] ; \ 

ip — 

// Add one particle to the particle array 

#define ADD_e() xpO [ s t ar t _p ar t i cl e + npO [MYTHREAD] ] = xpO[ip]; \ 

vpO[ start_particle + npO [MYTHREAD] ] = vpO [ ip ] ; \ 
npO [MYTHREAD] ++ 

// Add particle (exiting the left side) to the communication buffer 
#define OUT_e_LEFT() x_outLEFT [ p_outLEFT [MYTHREAD] ] [MYTHREAD] = xpO[ip]; \ 
v_outLEFT[p_outLEFT [MYTHREAD]] [MYTHREAD] = vpO[ip]; \ 
p_outLEFT [MYTHREAD]++ 

// Add particle ( exiting the right side ) to the communication buffer 
#define OUT_e_RIGHT () x_outRIGHT [p_outRIGHT [MYTHREAD] ] [MYTHREAD] = xpO[ip]; \ 
v_outRIGHT[p_outRIGHT [MYTHREAD]] [MYTHREAD] = vpO[ip]; \ 
p_outRIGHT [MYTHREAD]++ 

shared int npO [THREADS] ; // number of particles in the process 
shared [NP_PER_PROC] double xpO[NP]; // particle positions 
shared [NP_PER_PROC] double vpO[NP]; // particle velocities 

shared [NG_PER_PROC] double xg[NG]; // cell center positions 
shared [NG_PER_PROC] double rho [NG] ; // charge density 

shared double rho_ghost_right [THREADS] ; // rho ghost cell (right side) 
shared double rho _ghost_left [THREADS] ; // rho ghost cell (left side) 
shared [NG_PER_PROC] double phi [NG] ; // Potential 
shared [NG_PER_PROC] double E[NG]; // Electric field 

shared [NG_PER_PROC] double xkrylov [NG] ; // CG solver solution array 
shared [NG_PER_PROC] double r [NG] ; // CG solver residual array 

shared [NG_PER_PROC] double v[NG]; // CG solver helper array used in the CG solver 
shared [NG_PER_PROC] double z[NG]; // CG solver helper array used in the CG solver 

shared int p_outLEFT [THREADS ] ; // number of particles exiting the left side 

shared double x_outLEFT [NP_PER_BUFFER] [THREADS ] ; // positions of particles exiting the left side 

shared double v_outLEFT [NP_PER_BUFFER] [THREADS ] ; // velocities of particles exiting the left side 

shared int p_outRIGHT [THREADS ] ; // number of particles exiting the left side 

shared double x_outRIGHT [NP_PER_BUFFER] [THREADS] ; // positions of particles exiting the left side 

shared double v_outRIGHT [NP_PER_BUFFER] [THREADS] ; // velocities of particles exiting the left side 

int main() { 

// simulation parameter (each thread owns its own copy) 

int nop_init = NG_PER_PROC* 1000; // nop -init : number of particles per process 

if (nop_init > NP_PER_PROC){printf(''** nop_init too large change NP_PER_PROC \n"); return(-l);} 

int nt = 1000; // number of computational cycles 

double dt = 0.1; // time step 

double Lx = 3. 1415*THREADS; double dx - Lx/(( double )NG) ; // simulation box and grid spacing 

double tol = IE— 3; // solver tolerance 

int maxit = 150; // maximum number of iterations for the CG solver 

double VTO = 0.2; // particle thermal velocity 

double VO = 0.0; // drift velocity 

double rho_initO = 1.0; // initial density 

double qomsO = —1; // particle charge to mass ratio 



double start_x = MYTHREAD*(Lx/( ( double )THREADS) ) ; // first point of the thread domain 

double end_x = (MYTHREAD+l) *(Lx/( ( double )THREADS) ) ; // last point of the thread domain 

int start_cell = MYTHREAD*NG_PER_PROC ; // first cell index in the domain 

int end_cell = (MYTHREAD+l )*NG_PER_PROC-l; // last cell index in the domain 

int St art -particle = MYrHREAD*NP_PER_PROC ; // first particle index in the domain 

double QO = ( rho _initO *Lx /(( double )( nop _init *THREADS) )) *(qomsO/ fabs (qomsO) ) ; // particle charge 
double wl , w2, Ex; // interpolation weights and Electric field 
double dx_part = (Lx /(( double )( nop _init *THREADS+1) )) ; 
double initial _error , dotZV , c, d, t; // solver variables 
int ic , ip , it , ig , ii , isolver , counter ; // counters 

// initialize number of particles per thread 

npO [MYTHREAD] = nop_init; 

// initialize cell coordinates 

upc _f orall ( ic =0; ic<3^G;ic++; &xg [ ic ] ) {xg [ ic ] = dx/2.0 + dx*ic;} 

// initialize initial particle position 

counter = start_particle ; 

for ( ig= start_cell ; ig <= end_cell ; ig ++){ 

for (ii = 0; ii < ( npO [MYTHREAD] /NG_PER_PROC) ; ii++){ 
xpO[counter] = ig*dx + (ii + . 5 ) *(dx /( npO [MYTHREAD] /NG_PER_PROC) ) ; 
counter++ ; } } 

// initialize initial particle velocity (Muller box algorithm to generate a Maxwellian) 
double rl , r2 , r3 , w, pp ; 

for ( ip= start_particle ; ip < ( start_particle + npO [MYTHREAD] ) ; ip++){ 
do { 

rl = 2.0* rand /((double )RANDJVIAX) - 1.0; r2 = 2.0* rand ()/(( double )RANDJVIAX) - 1.0; 
w = rl *rl + r2*r2 ; 
} while ( w >= 1.0 ) ; 

w = sqrt (( — 2.0* log (w) ) /w) ; 

r3 = randO; if ( r3 /(( double )RANDJVIAX) < 0.5) vpO[ip] = VO + VTO* rl * w; 
else vpO[ip] = -VO + VTO* rl * w; } 

upc -barrier ; 

// start the computational cycle 
for ( it =0; it < nt ; it ++){ 
// update particle positions 

for ( ip= start_particle ; ip < ( start_particle + npO [MYTHREAD] ) ; ip++) xpO[ip] += vpO[ip]*dt; 
// clean the number of particles exiting left and right domain 

upc_forall(ic=0; ic<rHREADS; ic ++; &p_outLEFT [ ic ] ) {p_outLEFT [ ic ] = 0; p_outRIGHT [ ic ] = 0;} 
// //// the communication buffers 

for ( ip= start_particle ; ip < ( start_particle + npO [MYTHREAD] ) ; ip++){ 
// apply the periodic boundary conditions 
if (xpO[ip] < 0){ 

xpO[ip] += Lx; OUT_e_LEFT() ; DELETE_e(); 
} else if (xpO[ip] > Lx){ 

xpO[ip] -= Lx; OUT_e_RIGHT() ; DELETE_e(); 
} else { 

if (xpO[ip] > end_x ){ OUT_e_RIGHT() ; DELETE_e(); } 

else if (xpO[ip] < start_x){ OUT_e_LEFT() ; DELETE_e() ; } 

} 

} // end of the cycle 
upc_barrier ; 

// get particles from the right side 

for (ip = 0; ip < p_outLEFT [(MYTHREAD+l )%THREADS] ; ip++){ 

xpO[ start_particle + npO [MYTHREAD] ] = x_outLEFT [ ip ][ (MYTHREAD+l )%THREADS] ; 
vpO[ start_particle + npO [MYTHREAD] ] = v_outLEFT [ ip ][ (MYTHREAD+l )%THREADS] ; 
npO [MYTHREAD] + + ; } 

upc_barrier ; 

// get particles from the left side 

for (ip = 0; ip < p.outRIGHT [(MYTHREAD- 1 + THREADS ) %THREADS ] ; ip++){ 

xpO[ start_particle + npO [MYTHREAD] ] = x_outRIGHT [ ip ] [ (MYTHREAD-1 + THREADS ) %THREADS ] ; 

vpO[ start_particle + npO [MYTHREAD] ] = v_outRIGHT [ ip ][ (MYTHREAD-1 + THREADS ) %THREADS ] ; 

npO [MYTHREAD] + + ; } 
upc_barrier ; 

// set densities to zero 

upc_forall ( ic =0; ic<J^G;ic++; &rho[ic]) rho[ic] = 0.0; 
upc_forall(ic=0; ic<rHREADS ; ic ++; &rho_ghost_left [ ic ]) { 

rho_ghost_left [ ic ] = 0.0; rho_ghost_right [ ic ] = 0.0;} 
// add background ions 

upc _f orall ( ic =0; ic<J^G;ic++; &rho[ic]) rho[ic] += rho_initO; 
// Interpolation particle to grid 



142 for ( ip= start_particle ; ip < ( start_particle + npO [MYTHREAD] ) ; ip++){ 

143 ig = dtoi(xpO[ip]/dx+0.5) ; 

144 if(ig == start_cell) {wl = ( xg [ s t ar t _c e 1 1 ]-xpO [ ip ] ) / dx ; w2 = 1 - wl ; 

145 rho[ start_cell ]+=w2*Q0/dx; rho _ghost_left [MYTHREAD] +=wl*QO/dx;} 

146 else if (ig == ( end_cell +1) ) {wl = (xpO[ip] — xg [ end_cell ] ) /dx ; w2 = 1 — wl ; 

147 rho[end_cell] +=w2*Q0/dx; rho _ghost_right [MYTHREAD] +=wl*QO/dx;} 

148 else {wl = ( xg [ ig]-xpO [ ip ] ) /dx ; w2 = 1 - wl ; 

149 rho [ig]+=w2*Q0/dx; rho [ig -l]+=wl*QO/dx ; } 

150 } 

151 upc_barrier ; 

152 // communicate ghost cells 

153 rho [ start_cell ] += rho_ghost_right [(MYTHREAD-1 + THREADS)%THREADS] ; 

154 rho[end_cell] += rho _ghost_lef t [(MYTHREAD+1)%THREADS] ; 

155 upc_barrier ; 

156 // CG SOLVER 

157 upc_forall ( ic =0; ic^G;ic++; &xkrylov [ ic ] ) xkrylov[ic] = 0.0; 

158 upc _f orall ( ic =1 ; ic<J»JG; ic ++; &r[ic]){ r[ic] = — rho [ ic ] * dx*dx ; v[ic] = r[ic];} 

159 if (MYTHREAD==0){r [0] = 0.0; v[0] = 0.0;} // BC on Phi 

160 pp =0.0; upc _f orall ( ic =0; ic^G;ic++; &r[ic]) pp += r[ic]*r[ic]; 

161 c = bupc _allv -reduce _all ( double , pp ,UPC_ADD) ; 

162 initial _error = sqrt(c); 

163 i solver = 0; 

164 if ( initial_error > 1E-16){ 

165 while ( isolver < maxit){ 

166 if (MYTHREAD==0) z[0] = v[0]; // BC: PHI = on BC 

167 if (MYTHREAD==(THREADS-1)) z[NG-l] = v[NG-2] -2*v[NG-l]; 

168 upc_forall ( ic =1; ic<NG— 1; ic ++; &z[ic]) z[ic] = v[ic— 1] — 2*v[ic] + v[ic+l]; 

169 pp = 0.0; upc _f orall ( ic =0; ic^G;ic++; &z[ic]) pp += z[ic]*v[ic]; 

170 dotZV = bupc_allv_reduce_all (double ,pp ,UPC_ADD) ; 

171 t = c/dotZV; 

172 upc _f orall ( ic =0; ic^G;ic++; &xkrylov [ ic ] ) xkrylov[ic] += t*v[ic]; 

173 upc _f orall ( ic =0; ic^G;ic++; &xkrylov [ ic ] ) r[ic] — = t*z[ic]; 

174 pp = 0.0; upc _f orall ( ic =0; ic<3S[G; ic ++; &r[ic]) pp += r[ic]*r[ic]; 

175 d = bupc_allv_reduce_all(double ,pp,UPCJ^D) ; 

176 upc_forall ( ic =0; ic^G;ic++; &v[ic]) v[ic] = (d/c)*v[ic] + r[ic]; 

177 c = d; 

178 isolver++; 

179 } 

180 } else { } // converged at the first iteration 

181 upc _f orall ( ic =0; ic^G;ic++; &v[ic]) phi[ic] = xkrylov[ic]; 

182 upc_barrier ; 

183 // calculate the electric field 

184 upc_forall(ic=l; ic <NG- 1; ic ++; &E[ic]) E[ic] = - (phi[ic+l] - phi [ ic - 1]) /(2* dx ) ; 

185 if (MYTHREAD==0) E[0] =- (phi[l] - phi [NG-l])/(2* dx) ; // Periodic BC 

186 if (MYTHREAD==(THREADS-1)) E[NG-1] =- (phi[0] - phi [NG-2])/(2* dx) ; // Periodic BC 

187 upc_barrier ; 

188 // calculated the new particle velocity 

189 for (ip=start_particle ; ip < ( start_particle + npO [MYTHREAD] ) ; ip++){ 

190 ig = dtoi (xpO[ip ]/dx + 0.5); 

191 if (ig==0) {wl = (xg[0]-xp0[ip])/dx;w2 = 1 - wl ; Ex= w2*E[0] + wl*E[NG-l];} 

192 else if ( ig==NG) {wl = (xpO[ip] - xg [NG- 1]) /dx ; w2 = 1 - wl ; Ex = w2*E[NG-l] + wl*E[0];} 

193 else {wl = (xg [ ig]-xpO [ ip ] ) /dx ; w2 = 1 - wl ; Ex = w2*E[ig] + wl*E[ig-l];} 

194 vpO[ip] += qomsO*Ex* dt ; 

195 } 

196 } // end of the simulation 

197 upc -barrier ; 

198 return (0); 

199 } 
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